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Greater  computational  power  is  needed  for  solving  Computational  Fluid  Dynamics  (CFD) 
problems  of  interest  in  engineering  design.  Parallel  architecture  computers  offer  the  promise 
of  providing  orders  of  magnitude  greater  computational  power.  In  this  paper  we  quantify 
that  promise  by  considering  an  explicit  CFD  method  and  analyze  the  potential  parallelism 
for  three  different  parallel  computer  architectures.  The  use  of  an  explicit  method  gives 
us  a  “best  case”  analysis  from  the  point  of  view  of  parallelism,  and  allows  us  to  uncover 
potential  problems  in  exploiting  significant  parallelism.  The  analysis  is  validated  against 
experiments  on  three  representative  parallel  computers.  The  results  allow  us  to  predict 
the  performance  of  different  parallel  architectures.  In  particular,  our  results  show  that 
distributed  memory  parallel  processors  offer  greater  potential  speedup.  We  discuss  the  im¬ 
port  of  our  model  for  the  development  of  parallel  CFD  algorithms  and  parallel  computers. 
We  also  discuss  our  experiences  in  converting  our  model  code  to  run  on  the  three  different 
parallel  computers. 
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Introduction 


Computational  Fluid  Dynamics  (CFD)  problems  are  among  the  most  demanding  scientific 
computing  problems  in  terms  of  the  computational  resources  they  require.  Currently, 
CFD  problems  are  extensively  solved  on  vector  supercomputers,  primarily  Cray  and  Cyber. 
While  current  supercomputers  have  adequate  computational  power  to  solve  CFD  problems 
which  are  three  dimensional  or  unsteady  or  viscous,  extending  CFD  to  problems  which  are 
three  dimensional  and  unsteady  and  viscous  requires  vastly  more  computational  power. 

One  example  of  a  computationally  intensive  CFD  problem  is  modeling  the  air  flow 
through  an  airplane’s  gas  turbine  engine.  The  flow  through  a  gas  turbine  engine  is  three 
dimensional,  unsteady  and  viscous.  Further  the  geometry  is  extremely  complex  including 
several  turbine  stages,  several  compressor  stages  and  a  combustor.  Gas  turbine  engineers 
must  design  the  entire  engine  and  would  simulate  the  flow  through  the  entire  engine  if  the 
capability  were  available.  As  an  example  of  the  computational  intensity  of  CFD  calcula¬ 
tions,  Rai  recently  took  over  100  hours  on  a  Cray  X-NiP  to  run  a  3-dimensional  viscous 
unsteady  model  for  the  air  flow  past  two  stator  blades  and  a  single  rotor  blade  of  an  axial 
flow  gas  turbine  [Rai  87].  Although  this  run  time  is  far  too  excessive  for  the  use  of  such 
a  code  in  enginoering  design,  only  part  of  the  gas  turbine  was  modeled.  As  the  Cray  is 
on  of  the  fastest  machines  in  the  world  and  near  the  physical  limits  on  the  speed  of  a 
uni-processor,  no  existing  computer  is  fast  enough  for  these  calculations. 

To  speed  needed  calculations  up  several  orders  of  magnitude  faster  supercomputers  are 
needed  in  gas  turbine  engine  design.  Computer  processors,  however,  are  reaching  their 
physical  speed  limits.  Processor  designs  appear  to  be  within  a  single  order  of  magnitude  of 
their  speed  limits  due  to  physical  limitations  such  as  the  speed  of  light  and  gate  switching 
limits.  Therefore  needed  throughput  cannot  be  achieved  with  single  processor  computers. 

Parallel  processor  computers  offer  the  possibility  of  achieving  the  throughput  needed 
in  CFD.  Researchers  are  now  investigating  the  effectiveness  of  using  parallel  computers 
for  the  solution  of  CFD  problems  [Jame  87, John  87).  The  need  to  understand  and  exploit 
the  architecture  of  parallel  computers,  however,  makes  it  unclear  whether  we  can  design 
parallel  algorithms  which  will  achieve  the  needed  throughput  for  CFD  problems. 

To  test  the  promise  of  parallel  computers,  we  consider  a  model  CFD  problem  and 
apply  it  to  representatives  of  several  types  of  parallel  computers.  This  model  is  bjised  on 
an  explicit  difference  technique  and  thus  hais  the  greatest  opportunities  for  parallelism  (ais 
compared  to  implicit  difference  techniques).  For  each  class  of  parallel  computer,  we  develop 
a  time  complexity  model  and  validate  it  against  our  model  problem.  These  complexity 
estimates  are  analytical  estimates  of  the  computer  time  needed  to  solve  a  CFD  problem. 
The  complexity  estimates  are  expressed  in  terms  of  bzisic  machine  and  problem  parameters, 
such  as  floating  point  and  communication  speeds  and  number  of  grid  points.  From  these 
estimates,  we  can  estimate  the  performance  of  CFD  codes  on  supercomputers  utilizing 
significant  parallelism. 


Finally,  we  discuss  our  experiences  in  using  parallel  computers  for  solving  CFD  prob¬ 
lems. 

1  Architectures 

A  number  of  different  types  of  parallel  architecture  computers  are  available  today.  These 
run  from  very  fine  grain  machines  such  as  vector  processors  and  very  long  instruction 
word  machines  to  large  numbers  of  almost  independent  processors.  In  this  paper,  we 
will  consider  three  architectures  which  represent  an  important  part  of  the  spectrum  of 
possible  parallel  computers.  These  are  multiple  vector  processors,  tightly  coupled  MIMD 
and  loosely  coupled  MIMD. 

Multiple  vector  processors  consist  of  several  vector  computers  operating  in  parallel. 
Exchanges  of  information  and  synchronization  between  the  processors  is  facilitated  by 
special  high-speed  hardware.  Examples  are  the  Cray  X-MP,  Cray  2,  ETA  10,  and  the 
Alliant  FX/8,  that  last  of  which  was  used  in  this  study. 

Tightly  and  loosely  coupled  MIMD  computers  work  on  essentially  independent  data  in 
parallel,  with  differing  kinds  of  access  to  each  other’s  data. 

Tightly  coupled  MIMD  computers  are  collections  of  independent  processors  which  are 
closely  connected,  usually  by  shared  memory.  A  parallel  program  running  on  this  type 
of  machine  consists  of  independent  threads  of  control  which  may  access  each  other’s  data 
and  can  tightly  control  each  other’s  operation  by  manipulating  this  shared  data. 

The  advantage  of  the  tightly  coupled  MIMD  approach  is  that  there  is  no  single  thread 
of  control,  and  hence  no  a-priori  serial  execution.  There  have  been  a  number  of  machines 
of  this  type  built  by  recent  startup  companies,  such  as  the  Encore  Multlmax  120,  used  in 
this  study  and  the  Sequent  Balance  series. 

Loosely  coupled  MIMD  machines  are  collections  of  independent  processors  which  com¬ 
municate  through  some  reliable  mechanism,  but  which  don’t  directly  share  any  data.  In¬ 
stead,  all  interprocessor  sharing  of  data  is  done  by  I/O  operations,  typically  the  sending 
and  receiving  of  message  packets.  This  provides  a  measure  of  programming  safety  and  re¬ 
producibility  of  results  often  absent  in  shared  memory  (tightly  coupled  MIMD)  machines, 
since  all  modifications  to  “shared”  data  structures  is  handled  explicitly  by  the  programmer, 
rather  thzm  implicitly  through  a  shared  memory  access.  However,  most  systems  currently 
on  the  market  have  a  high  overhead  associated  with  interprocessor  communication. 

Example  machines  of  this  type  include  the  various  hypercubes  from  Intel,  NCUBE,  and 
others,  and  some  research  machines,  such  as  the  LCAP  system  of  IBM.  The  systems  them¬ 
selves  vary  from  a  few,  fast  processors  (LCAP)  to  many  slow  processors  (Intel  Hypercube). 
The  Intel  Hypercube  was  used  in  this  study. 


2  Finite  Difference  Algorithms 

Finite  difference  algorithms  used  in  Computational  Fluid  Dynamics  for  time  accurate  cal¬ 
culations  can  be  divided  into  two  major  types:  explicit  and  implicit. 

With  explicit  finite  difference  algorithms  the  value  of  a  variable  at  the  new  time  is 
determined  from  the  values  of  the  variables  at  the  old  time  directly  (that  is  explicitly) 
without  dependence  on  the  values  at  the  new  time.  Practically  speaking  this  means  that 
the  values  of  the  variables  can  be  determined  without  solving  a  system  of  equations.  The 
time  step  size  that  can  be  taken  is  limited  by  a  time  step  on  the  order  of  the  Courant- 
Freidrichs-Lewy  (CFL)  criterion  (the  smallest  value  of  the  time  for  a  particle  traveling  at 
the  local  fluid  velocity  to  cross  a  finite  difference  grid  interval).  In  many  applications  the 
CFL  time  step  is  much  smaller  (often  orders  of  magnitude  smaller)  than  the  time  step 
required  for  accurate  calculation  of  the  time  variation  of  the  variables. 

The  model  problem  studied  here  uses  MacCormack’s  algorithm  [MacC  69]  which  was 
chosen  for  its  simplicity  and  wide  familiarity  to  the  CFD  community.  Major  features  of 
MacCormack’s  algorithm  are: 

•  An  explicit  difference  scheme  which  is  second  order  accurate  in  space  and  first  order 
accurate  in  time. 

•  A  two  step  scheme  in  which  the  present  time  values  are  used  to  calculate  the  inter¬ 
mediate  values  in  the  first  step,  and  the  present  time  and  intermediate  values  are 
used  to  calculate  the  values  at  the  new  time. 

Implicit  finite  difference  algorithms  differ  from  explicit  finite  difference  algorithms  in 
that  the  determination  of  the  values  of  the  variables  at  the  new  time  depends  (implicitly) 
on  the  values  of  the  variables  at  the  new  time.  The  advantage  of  a  well-chosen  implicit 
scheme  is  that  numerical  stability  can  be  achieved  with  a  time  step  size  much  greater  than 
given  by  the  explicit  CFL  limit  [Jame  83]. 

The  price  to  be  paid  by  most  implicit  schemes  is  that  the  solution  of  a  system  of 
equations  is  required  at  each  step.  Except  for  the  solution  of  a  tightly  banded  linear 
system  it  is  often  faster  to  solve  the  linear  system  iteratively. 

We  left  implicit  schemes  for  later  study  in  order  to  address  the  basic  question  of  whether 
parallel  architecture  computers  were  applicable  to  solving  CFD  problems.  We  intend  to 
study  implicit  schemes  and  linear  solution  techniques  in  future  work.  The  use  of  sparse  ma¬ 
trix  methods  in  CFD  problems  is  an  area  of  current  research  on  serial  computers  [Wigt  85] 
2is  well  as  on  parallel  computers. 

3  Model  Problem 

Descriptions  of  the  physical  model  problem,  its  geometry,  its  mathematical  formulation, 
and  the  numerical  solution  method  follow. 
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The  physical  model  problem  we  consider  is  unsteady  compressible  viscous  flow  through 
an  insulated  duct.  We  further  assume  the  flow  to  be  axisymmetric  with  swirl.  A  time 
dependent  model  problem  was  chosen  for  this  study  since  time  marching  schemes  are 
generally  used  in  CFD  for  both  steady  and  unsteady  problems.  A  two  dimensional  problem 
was  chosen  both  to  provide  a  start  at  physically  reasonable  models  for  combustor  flow 
and  to  introduce  the  complexities  of  multi-dimensional  problems.  For  the  model  problem 
studied  here  the  fluid  velocities  and  pressure  are  known  so  that  the  temperature  can  be 
obtained  by  integrating  in  time  the  energy  equation  which  is  linear  in  temperature.  The 
density  was  obtained  from  the  perfect  gas  law. 

The  case  studied  was  obtained  from  the  model  problem  by  «issuming  Poiseuille  flow 
(parabolic  velocity  profile  and  linear  pressure  gradient)  and  treating  the  residual  terms  as 
a  heat  source.  Poiseuille  flow  is  an  exact  solution  for  steady,  incompressible,  isothermal, 
viscous  pipe  flow.  We  wished  to  retain  the  compressible  effects  in  case  they  affected  the 
parallelization  of  the  code.  Poiseuille  flow  does  not  exactly  satisfy  the  compressible  flow 
equations.  There  are  residual  terms  which  are  collectively  treated  as  a  heat  source.  Thus 
the  solution  to  the  case  studied  was  steady  Poiseuille  flow  and,  in  particular,  constant 
temperature.  This  facilitated  debugging. 

The  geometry  of  the  model  problem  is  a  rectangular  domain  with  a  tensor  product 
mesh.  The  left  side  is  the  inflow  and  the  right  side  is  the  outflow.  The  top  is  the  insulated 
wall,  and  the  bottom  is  the  centerline  of  the  cylinder. 

The  mathematical  formulation  of  the  energy  equation  in  cylindrical  coordinates  is  given 
by  equations  1  through  6. 
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Q  is  the  heat  source  term.  In  the  case  studied  Q  was  set  equal  to  the  right  hand  side  of 
equation  1  evaluated  for  Poiseuille  flow 
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Density  is  determined  from  the  equation  of  state  (perfect  gas  law) 

p  ~  t 


(5) 

(6) 


and  T  is  the  unknown  to  be  solved  for.  To  simplify  debugging  the  solution  T  =  Tq  (Tq  a 
constant)  was  used. 

The  numerical  solution  method  for  the  above  partial  differential  equations  is  by  Mac- 
Cormack’s  explicit  algorithm  [MacC  69].  The  finite  difference  form  of  these  equations  using 
MacCormack’s  algorithm  is  given  by  equations  7  and  8. 

In  the  first  step  of  MacCormack’s  two  step  difference  scheme  the  derivatives  of  F  and  H 
are  calculated  using  backward  differences  with  central  differences  for  the  terms  involving 
second  derivatives. 
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In  the  second  step  of  MacCormack’s  scheme  the  derivatives  of  F  and  H  are  calculated 
using  forward  differences  with  central  differences  for  the  terms  involving  second  derivatives. 
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In  CFD  studies  of  duct  flow  typical  grid  sizes  are  order  of  50  to  100  (radial)  by  100  to 
1000  (axial),  and  the  CFL  limited  time  ste^^s  generally  number  in  the  tens  of  thousands. 
In  this  study  we  used  matrices  from  50  x  50  to  250  x  250  in  size  and  took  from  10  to  50 
time  steps.  The  matrix  sizes  were  chosen  to  test  effects  of  memory  size  and  the  number  of 
time  steps  was  chosen  large  enough  to  get  accurate  timings. 

The  boundary  conditions  are  specified  inflow  and  axially  smooth  outflow  with  insulated 
cylinder  walls.  The  geometry  was  chosen  to  keep  the  code  simple  for  transporting  from 
one  parallel  machine  to  another. 
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4  Time  Complexity  Model 

In  order  to  understand  the  potential  of  parallel  programming  for  CFD  problems,  we  need 
to  develop  models  which  will  allow  us  to  predict  the  performance  of  a  parallel  computer. 
Time  complexity  models  are  constructed  for  each  of  the  three  clcisses  of  parallel  computers 
used  in  this  study.  Since  the  time  complexity  model  is  an  analytic  expression  for  the 
computer  time  needed  to  solve  a  problem,  based  on  fundamental  parameters  of  the  model 
problem  and  the  computer,  we  can  use  the  complexity  estimates  to  predict  the  performance 
of  new  parallel  computers  and  suggest  which  architectures  offer  the  most  potential  for  CFD 
problems. 

In  this  section  we  will  first  give  some  basic  definitions,  then  explain  superlinear  speedup. 
Then  we  will  give  time  complexity  estimates  for  simple  stencils  and  apply  and  extend  these 
estimates  to  our  model  problem. 

We  will  denote  the  time  complexity  by  T(p,  n),  where  p  is  the  number  of  processors 
and  the  problem  is  on  an  n  x  n  mesh.  Another  useful  measure  is  the  speedup,  s,  defined 
by  5(p, n)  =  T{l,n)  jT{p,n).  In  a  perfectly  parallel  algorithm  with  perfect  hardware  and 
software,  T{p,n)  =  T(l,n)/p.  Algorithms,  hardware,  and  software  are  not,  however, 
perfect  and  this  section  is  concerned  with  modeling  the  “imperfections.”  We  therefore 
define  the  computational  efficiency  as  ^(l,  n)/(pr(p,  n)).  A  perfect  parallel  computer  and 
algorithm  has  a  computational  efficiency  of  1. 

The  nature  of  these  imperfections  depends  on  the  type  of  parallel  computer.  However, 
they  are  related  to  the  access  to  memory.  In  a  loosely-coupled  MIMD  parallel  processor, 
it  is  the  interprocessor  communication  time,  which  can  be  thought  of  as  access  to  remote 
memory,  which  is  important.  In  a  tightly-coupled  MIMD  parallel  processor,  it  is  the  access 
to  shared  memory,  and  the  establishment  of  critical  regions  (access  control  to  memory) 
which  is  important.  In  a  multiple  vector  parallel  processor,  it  is  the  access  to  shared 
memory  again  and  cache  contention  among  the  processors  which  is  important.  In  addition, 
various  “non-parallel”  operations  may  reduce  the  possible  speedup. 

4.1  Superlinear  speedup 

As  an  example  of  how  memory  access  can  affect  the  time  complexity  on  parallel  computers, 
we  discuss  a  commonly  misunderstood  phenomena  called  superlinear  speedup.  Superlin¬ 
ear  speedup  is  a  speedup  higher  than  the  speedup  of  a  “perfect”  parallel  computer  and 
algorithm — that  is,  a  computational  efficiency  greater  than  1.  This  effect  is  real,  and  it  is 
caused  by  “non-linear”  features  in  the  description  of  the  parallel  computer. 

In  particular,  it  is  evident  if  the  process  of  breaking  the  program  into  a  number  of 
parallel  pieces  produces  smaller  pieces  which  can  use  the  hardware  more  effectively. 

An  example  is  a  machine  where  each  processor  has  a  memory  cache  of  size  C  ;  let  the 
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time  to  run  a  problem  of  size  m  on  p  processors  be 


Tm{p)  =  —  +ff(m,p) 
P 


where 


—  f  ^  P  if  ^  P  >  c 
g{m,p)  -  \  .v  • 

1 0  otherwise. 


The  g  represents  the  time  to  fill  the  cache  from  memory  if  the  cache  can  not  hold  the  entire 
problem  (i.e.,  Tm(l)  =  2m)  then,  as  p  — *•  oo, 


T„(l) 

VTm(p) 


twice  the  “perfect”  speedup. 

Now,  if  instead  we  insist  that  m  be  “large  enough”,  ie.,  m  >  pC,  then  we  leave  the 
regime  where  the  problem  fits  in  the  cache,  and  the  speed  up  is  linear; 


Tmjl)  ,  , 
pTmip) 

as  both  p  and  m  go  to  infinity. 

This  effect  is  important  in  modern  parallel  computers  because  the  relevant  “cache  size” 
has  become  very  large.  For  example,  in  a  tightly-coupled  MIMD  shared  memory  machine 
like  the  BBN  Butterfly,  the  cost  for  accessing  non-local  memory  is  roughly  three  times  that 
of  local  memory.  Thus  the  entire  local  memory  (several  megabytes)  can  be  considered  the 
“cache”  for  the  purposes  of  this  argument.  This  effect  can  generate  anomalous  results  for 
small  problems. 


4.2  Time  complexity  estimates  for  simple  stencils 

In  order  to  motivate  our  choice  of  complexity  model,  we  first  develop  a  model  for  finite 
difference  schemes  with  5  and  9  point  stencils  for  the  three  parallel  architectures  used  in  this 
study.  Using  these  results,  we  can  construct  a  model  which  predicts  the  time  complexity 
of  our  model  problem.  We  also  discuss  some  global  operations  which  are  present  in  our 
model  algorithm,  and  how  they  are  handled  on  different  computer  architectures. 

We  first  establish  some  basic  ideas  concerning  two  main  cleisses  of  parallel  programming 
styles:  message  passing  and  shared  memory.  We  first  describe  these  two  approaches  and 
then  discuss  the  similarities  in  terms  of  both  expressiveness  and  of  efficiency. 

In  a  message  p2issing  computer,  memory  access  for  data  on  different  processors  is  via 
inter-processor  communication  which  is  handled  by  explicitly  sending  a  message  from  one 
processor  to  another.  There  are  two  major  parameters  of  interest.  These  are  the  message 
startup  time  a,  and  the  transfer  rate  r.  To  make  a  and  r  independent  of  particular 
hardware  implementations,  a  and  r  have  been  nondimensionalized  by  dividing  by  the 


floating  point  speed  in  flops/sec.  Thus  a  is  in  ops/startup  and  r  is  in  ops/word.  VVe  will 
denote  the  floating  point  computation  rate  by  /  flops/sec. 

For  the  tightly  coupled  computers,  memory  access  is  from  shared  memory  to  processor 
and  the  three  parameters  of  interest  are  the  memory  access  time,  the  speed  of  a  floating 
point  operation  and  the  number  of  simultaneous  memory  references.  The  programming 
constructs  used  with  shared  memory  machines  include  barriers  and  critical  regions.  A 
barrier  is  a  synchronization  point  which  all  processors  must  reach  before  any  of  them  can 
proceed.  A  critical  region  allows  only  one  process  to  access  and  modify  data  at  a  time.  In 
a  parallel  program,  barriers  may  be  used,  for  example,  to  wait  for  all  processors  to  reach 
the  end  of  a  time  step.  A  critical  region  may  be  used  to  provide  a  unique  do-loop  index 
to  each  parallel  processor.  These  are  discussed  in  more  detail  in,  e.g.,  [Andr  85]. 

Message  passing  computers  such  as  the  Intel  Hypercube  and  shared  memory  machines 
such  as  the  Encore  Multimax  are  often  considered  two  completely  different  types  of  parallel 
computer.  In  fact,  when  viewed  in  the  right  way,  they  are  very  similar.  What  distinguishes 
them  is  the  relative  cost  of  referencing  remote  or  shared  information.  The  following  table 
shows  the  correspondence  between  the  costs  of  interprocessor  communication  for  the  two 
types  of  machines. 


Message  Passing 


I/O  startup  time 


Transfer  rate 


Shared  Memory 

Cache  miss  overhead  and  time 
to  establish  critical  regions 
and  barriers 
Memory  transfer  rate 


Each  of  these  machines  have  different  strengths.  The  shared  memory  machines  have  faster 
access  to  shared  information,  but  pay  a  penalty  in  terms  of  higher  overheads  in  the  forms 
of  barriers  and  critical  regions.  In  general,  these  barriers  and  critical  regions  are  intrinsic 
to  the  multiprocess  computation  and  are  therefore  unavoidable.  Message  pcissing  machines 
have  simpler  access  control  but  at  a  higher  cost  in  sharing  data.  Barriers  and  critical  regions 
are  sometimes  used  with  message  pcissing  computers;  however,  the  style  of  programming 
normally  used  with  message  passing  makes  them  less  important. 

Consider  the  cost  of  computing  a  step  of  an  explicit  PDE  on  an  n  x  n  (in  2-d)  and  an 
n  X  n  X  n  (in  3-d)  grid.  This  step  will  usually  consist  of  an  estimate  of  the  time  step  size  to 
use,  based  on  CFL  estimates,  followed  by  the  application  of  a  local  stencil.  The  estimate  of 
the  time  step  requires  global  information  (the  solution  everywhere);  the  application  of  the 
stencil  just  local  or  neighbor  information.  We  describe  first  the  local  communication  and 
then  the  global  communication.  Since  these  effects  are  most  noticable  in  the  context  of 
message  parsing,  we  describe  them  in  those  terms.  However,  similar  considerations  apply 
to  shared  memory. 


Figure  1:  Two  different  decompositions  of  a  2-d  domain.  Each  outlined  area  is  given  to  a 
different  processor. 

The  first  step  in  any  parallel  program  is  the  division  of  work  among  the  processors. 
Divide  this  grid  among  the  processors  by  slicing  the  problem  domain  by  planes  in  1,  2, 
or  all  3  cartesian  directions.  Call  the  resulting  decomposition  1-d,  2-d,  or  3-d  slices.  An 
example  of  1-d  and  2-d  slices  for  a  2-d  domain  is  shown  in  Figure  1. 

In  each  Ccise,  it  is  the  length  of  the  boundary  of  the  slice  which  determines  the  com¬ 
munication  cost.  For  example,  in  2-d  where  the  domain  is  sliced  in  one  direction,  the 
boundaries  between  processors  are  n  mesh-points  long,  so  that  the  effort  for  one  step  is 

- h  2(a  +  rn). 

P 

(All  times  are  expressed  in  nondimensional  units  of  floating  point  operations.)  The  “2” 
comes  from  sending  the  left  internal  boundary  to  the  left  neighbor  and  the  right  internal 
boundary  to  the  right  neighbor.  While  in  principle  these  operations  could  go  on  simul¬ 
taneously,  in  practice  they  require  essentially  exclusive  use  of  the  processor  and  memory 
bus.  For  slices  in  two  directions,  there  are  now  either  4  neighbors  (for  a  5-point  stencil)  or 
8  neighbors  (for  a  9-point  stencil),  and  the  cost  becomes 

Here  Oj  to  denotes  the  cost  of  a  two  hop  link  and  is  only  present  in  the  9-point  case. 
Similar  analysis  can  be  carried  out  for  other  cases;  they  are  summarized  in  Table  1. 

From  these  formulas  it  is  clear  that  once  the  problem  is  large  enough  (n  large),  the 
communication  costs  become  negligible.  However,  in  practice  the  problems  are  neither 
enormous  nor  the  constants  the  same  size.  Thus,  the  communication  terms,  though  asymp- 


Table  1:  Times  for  various  decompositions  of  the  domain,  for  both  2  and  3-d  domains, 
is  the  cost  for  a  two-hop  link  and  0-3  the  cost  for  a  three  hop  link.  The  terms  containing 
them  are  not  present  for  5  or  7  point  stencils. 


totically  negligible,  can  be  of  dominant  importance.  For  example,  take 

p  =64 
n  =  256 

/  =1  pseconds/mesh-point 

fa  =3  milliseconds 
fr  =1  juseconds/word. 

(Recall  that  a  and  r  are  expressed  in  terms  of  the  floating  point  computation  rate  /). 
This  represents  a  fast  node  (over  1  megaflop)  with  moderate  communication  speeds  and 
has  at  at  least  2  megabytes  of  memory.  An  existing  m£u:hine  with  similar  parameters  is  the 
Intel  VX  hypercube.  In  this  case,  in  3-d  with  1-d  slicing  and  a  5-point  stencil  and  using 
the  time  estimates  in  Table  1,  the  communication  time  is  roughly  0.12  seconds  and  the 
computation  time  is  roughly  0.24  seconds.  The  communication  is  taking  one  third  of  the 
total  time.  Further,  this  is  assuming  that  the  message  requires  only  one  startup,  despite 
its  large  size  (n^  words).  If  we  also  require  all  messages  to  be  less  than  16384  bytes,  this 
adds  another  0.1  seconds  of  startup  time,  making  communication  almost  half  of  the  total 
time. 

The  analysis  so  far  has  been  adequate  for  centered  single  step  schemes  such  as  Lax- 
Wendroff  or  leapfrog.  However,  many  schemes  use  combinations  of  one-sided  stencils.  The 
MacCormick  scheme  used  in  our  model  problem  is  one  such  scheme.  In  this  case,  we 
simply  consider  ezich  access  separately.  Speciflcally,  sharing  information  at  a  boundary 
with  another  processor  takes  time 

a  -t-  rn.  (9) 

One  problem  with  considering  each  access  separately  is  that  there  may  be  synchroniza¬ 
tion  delays  caused  by  different  processors  finishing  at  different  times.  We  will  ignore  these 
effects  for  now;  however,  they  are  likely  to  become  more  important  in  large  scale  MIMD 
parallelism. 


A  certain  amount  of  global  communication  is  necessary  in  these  algorithms  as  well.  For 
example,  the  choice  of  time  step  size  requires  the  global  value  of  the  maximum  velocity.  In 
addition,  the  values  of  interest  in  the  computation  include  integrals  along  the  boundary, 
which  may  need  to  be  computed  across  processors.  These  operations  fall  into  the  general 
category  of  reductions;  taking  data  from  many  places  and  combining  it  to  produce  a  single 
result. 

Reductions  are  often  done  using  a  fast  distribution  algorithm  such  as  those  in  [Saad  85], 
with  the  operation  (maximum  or  sum)  inserted  into  the  distribution.  For  example,  a  tree- 
based  distribution  algorithm  such  as  those  described  in  [Saad  85]  is  fast  on  both  shared 
memory  and  hypercube/message  passing  computers.  In  this  case,  the  time  to  reduce  p 
items  is 

(a  +  r  -I-  /)  log  p 

assuming  simultaneous  send/receives.  One  potential  problem  here  is  possible  round-off 
error;  to  avoid  that,  we  reduce  upward  using  the  tree  to  a  single  node,  then  distribute  that 
single  value  downward  to  all  nodes,  again  using  the  tree.  The  time  is  then 

(2(q: -I- r) -I- /)  logp.  (10) 

When  using  a  small  number  of  processors  and  a  shared-memory  or  complete  intercon¬ 
nect,  it  is  often  easier  for  a  single  processor  to  compute  the  reduction,  then  make  that 
value  available  to  all  processors.  In  this  Ccise,  the  time  is  p/  plus  the  time  to  establish 
two  barriers,  one  before  the  reduction  to  insure  that  the  data  is  ready,  and  one  after  it  to 
insure  that  the  result  is  ready. 

4.3  Time  complexity  estimates  for  the  model  problem 

Our  model  problem  is  more  complicated  than  a  simple  5-point  explicit  scheme  because  it 
is  multistep  and  each  step  is  not  centered.  However,  we  can  break  it  down  into 

•  floating  point  work 

•  local  data  exchanges 

•  global  data  exchanges. 

This  form  is  suitable  to  a  program  where  the  parallelism  is  expressed  explicitly.  In  our 
particular  implementation,  we  can  write  the  total  computation  time  per  step  as 

n 

T  =  Fi - V-  Fi - h  8 “global  reduce”  +  6  (“send  in  y”  4-  “send  in  x”)  (11) 

P  Pv 

where  Fy  is  the  floating  point  time  per  mesh  point  for  operations  on  the  whole  grid,  F^ 
is  the  floating  point  time  per  mesh  point  for  operations  along  one  boundary,  the  grid  is 
n  X  n,  p  is  the  number  of  processors,  and  p^  is  the  number  of  processors  in  the  y  direction. 


Using  equations  9  and  10  as  the  communication  model,  equation  11  becomes 


T  = 


+  F2  —  +  8(a  +  r)  log  p  +  6  [  a  +  r  — 

Py  V  Px 


+  6  a  +  r  — 

V  Py 


The  memory  access  terms  uiis  equation  are  the  direct  communication  terms  involving  a 
and  r. 

For  this  local  communication  model  to  be  applicable,  it  must  be  possible  to  map  a  grid 
onto  the  parallel  processor  in  such  a  way  as  to  make  adjacent  slices  adjacent  in  the  parallel 
processor.  We  will  assume  that  any  2  or  3-d  grid  of  interest  can  be  efficiently  embedded  in 
the  parallel  processor.  Such  embeddings  are  easy  on  hypercubes;  alternatively,  everything 
we  say  here  applies  equally  well  to  a  mesh  connected  processor  of  the  correct  dimension. 

For  a  shared  memory  machine  with  sufficient  memory  bandwidth,  the  figures  are  sim¬ 
ilar,  except  the  global  reduction  is  done  differently,  and  the  overhead  of  data  sharing  is 
slightly  different.  In  this  case,  the  results  are 


T  =  Fi - 1-  F2 - 1-  16 “barriers  for  reduce”  -l-  6 “barriers  for  data”. 

P  Py 


Here,  the  “barriers”  are  synchronization  points  in  the  code.  Depending  on  the  implemen¬ 
tation,  these  can  be  order  p  or  logp.  Further,  it  is  possible  to  reduce  the  number  of  these 
by  using  barriers  with  values  [Else  87]. 

On  a  tightly  coupled  multiple  vector  processor  such  as  the  Alliant  FX/8  the  work  es¬ 
timates  are  slightly  different.  In  these  machines,  “close  coupling”  of  the  processors  allows 
the  rapid  transfer  of  information  from  one  processor  to  another,  and  extremely  fast  syn¬ 
chronization.  On  the  negative  side,  since  there  is  no  explicit  parallelism,  the  programmer 
may  be  dependent  on  the  Fortran  compiler  to  generate  parallel  code  and  it  is  possible  for 
there  to  be  substantial  amounts  of  non-parallel  code  generated.  Further,  any  shared  mem¬ 
ory  machine  suffers  from  a  potential  bottleneck  in  memory  access;  depending  on  cache 
or  register  utilization  and  the  exact  pattern  of  operations  chosen  by  the  compiler,  the 
performance  may  exhibit  anything  from  superlinear  speedup  to  a  plateau  in  performance. 

We  can  model  this  bottleneck  in  shared  resources  such  a.s  memory  as  follows.  Let  the 
computation  consist  of  two  parts;  one  which  is  arbitrarily  parallel,  and  one  which  uses  the 
shared  resource.  For  example,  consider  a  machine  with  p  processors  in  which  p  floating 
point  operations  may  be  done  in  parallel.  However,  only  k  processors  may  use  the  shared 
memory  at  any  time.  Then  the  time  to  do  a  computation  has  the  form 

T{p)  =  -  +  -.  -7--  , ,  •  (14) 

p  min(p,k) 

For  A:  =  1,  this  is  the  well-known  Amdahl’s  law.  The  speedup  predicted  for  such  a  machine 
as  p  gets  large  (p  >  k)  is 


>(P)  = 


a  -f-  6 


min(p,*) 


(a  -I-  b)kp 
ak  +  bp 


As  p  — ►  oo,  the  speedup  tends  to  (a  +  b)k/b  or 


(*  1)  *■ 


Figure  2  shows  the  behavior  of  s  for  different  values  of  k.  Note  the  approach  to  the 
asymptotic  value  of  s,  which  for  the  example  values  of  a  and  b  is  9k. 

This  “bottleneck”  effect  is  present  in  both  multiple  vector  processors  and  in  tightly- 
coupled  shared  memory  computers.  Jordan  [Jord  87]  discusses  these  effects  in  more  detail. 

We  have  not  considered  all  possible  effects.  One  important  effect  is  load  balancing. 
On  distributed  memory  machines,  the  current  approach  is  to  divide  the  problem  statically 
among  the  processors.  This  is  a  large  grain  division  of  work.  If  the  computational  work  is 
not  equally  divided  among  the  processors,  there  will  be  additional  costs  caused  by  some 
processors  going  idle  while  others  continue  to  work.  However,  this  effect  diminishes  as  the 
size  of  the  problem  relative  to  the  number  of  processors  increases. 
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5  Experimental  Results 

In  this  section  we  show  the  results  of  computing  our  model  problem  on  several  different 
types  of  architectures. 

We  ran  our  model  problem  on  three  different  parallel  computers.  An  Intel  iPSC  Hy¬ 
percube,  an  Encore  Multimax  120,  and  an  Alliant  FX/8.  The  Intel  Hypercube  is  a  loosely 
coupled  MIMD,  the  Encore  is  a  tightly  coupled  MIMD,  and  the  Alliant  is  a  multiple  vector 
processor.  Each  of  these  represents  a  different  class  of  parallel  computer. 

Figures  3,  4,  and  5  show  the  results  of  our  experiments.  We  have  computed  fits  to 
these  experiments  using  the  models  developed  in  Section  4.  These  fits  were  obtained  by  a 
non-linear  least-squares  fit  to  the  timing  data,  after  scaling  the  timing  data  by  n^/p. 

5.1  Intel  iPSC  Hypercube 

The  Intel  iPSC  Hypercube  is  an  example  of  a  loosely  coupled  MIMD  computer.  The  pro¬ 
cessors  are  connected  in  a  hypercube  architecture,  and  communicate  by  sending  messages 
over  a  dedicated  ethernet  link  (one  per  processor-pair).  Each  processor  has  0.5  megabytes 
of  memory,  and  on  our  machine,  there  is  an  additional  4.0  megabytes  of  memory  on  an 
attached  board. 

Figure  3  presents  the  results  of  these  experiments.  Each  computation  represents  10 
time  steps  over  the  grid.  A  fit  to  the  data  using  equation  12  gives 

Tl 

T  =  0.066—  +  0.014—  +  0.861  log  p  +  0.0302(nx  +  n  J  +  0.868,  (17) 

P  Pv 

where  rii  =  n/pi  and  Uy  —  n/py. 

In  Figure  3,  we  see  the  effects  of  the  uneven  distribution  of  work  among  the  processors 
in  the  stepped  increase  in  the  speedup  figures.  Each  processor  heis  some  fixed  number  of 
“slices”,  each  of  which  has  roughly  n/p,  x  n/py  mesh  points.  If  p^  or  Py  do  not  divide  n, 
then  some  processors  have  [n/p*]  x  [n/py]  and  some  have  [n/p^J  x  [n/p^J.  The  relative 
difference  is  approximately  (assuming  both  pz  and  py  don’t  divide  n) 

P»  I  Py 
n  n 

As  the  number  of  processors  approaches  n,  the  difference  can  be  very  large.  The  points  in 
p  where  there  are  jumps  in  the  speedup  for  the  1-d  decomposition  are  just  those  values  of 
p  which  divide  n,  as  can  be  seen  in  Figure  3.  Note  that  the  hypercube  interconnections  are 
not  of  major  importance;  a  2-d  mesh  interconnection  would  give  almost  the  same  results. 

5.2  Encore  Multimax 

The  Encore  Multimax  is  an  example  of  a  shared  memory  MIMD  computer.  There  are 
two  processors  per  board,  along  with  a  cache  memory.  These  boards  are  connected  to 
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Figure  3;  Speed  up  figures  for  the  Intel  iPSC  hypercube.  The  graphs  show  fits  as  a 
function  of  decomposition  at  a  fixed  problem  size,  and  as  a  function  of  problem  size  for 
fixed  decomposition.  The  same  fit  parameters  were  used  in  all  graphs,  and  are  drawn  as 


curves. 


a  very  fast  bus,  on  which  the  shared  memory  resides.  The  computer  runs  Unix”^^  and 
provides  a  time-sharing  environment.  One  problem  when  doing  timing  measurements  in 
a  multi-user  environment  is  that  any  parallel  job  is  competing  for  processors  with  other 
users,  mailer  daemons,  etc.  Timings  are  thus  somewhat  erratic.  Further,  they  don’t  reflect 
synchronization  problems  well,  since  idle  processes  won’t  use  CPU  time,  and  “real  time” 
measurements  are  even  more  inaccurate  on  a  time-sharing  system.  Thus,  our  results  shown 
in  Figure  4  are  at  best  approximate.  However,  they  do  match  our  speedup  estimate  very 
well  with  parameters  in  equation  18,  which  is  a  combination  of  equation  13  and  equation  14. 

2  2 

T  =  0.059—  -h  0.0080~-r-^^ - r  -I-  0.08—  -H  0.29p.  (18) 

p  mm(p,  10)  Py 

These  parameters  give  a  very  good  fit  to  our  experimental  data  as  shown  in  Figure  4. 

Note  that  since  the  onset  of  the  “bottleneck”  term  is  so  high,  this  may  be  due  more  to 
the  availability  of  processors  than  to  limitations  in  the  hardware.  In  fact,  the  Encore  bus 
is  very  fast  relative  to  the  speed  of  the  processors,  and  we  do  not  expect  to  see  a  significant 
degradation  for  the  number  of  processors  available. 

5.3  Alliant  FX/8 

The  Alliant  FX/8  is  an  example  of  a  multiple  vector  processor  computer.  It  has  up  to  eight 
“computational  elements”,  each  of  which  is  a  vector  processor.  These  processors  share  two 
high-speed  caches,  which  in  turn  have  a  high  speed  channel  to  memory.  It  is  possible  for 
the  processors  to  exceed  speed  of  the  cache. 

Computations  were  done  for  51  time  steps  and  mesh  sizes  of  n  =  50,  100,  150,  200,  and 
250  on  an  eight  processor  Alliant  FX/8.  Figure  5  shows  the  speedups  observed  for  various 
problem  sizes  compared  to  a  fit  against  Equation  14.  The  fit  used  is 

2  2 

T  ^  0.00645—  -I- 0.00085 — — , 
p  min(p, «) 

where  k  =  2.45.  The  fit  is  insensitive  to  k  in  the  range  2  <  k  <  3. 

The  apparent  super-linear  speedup  for  two  processors  shown  by  the  curves  in  Figure  5 
is  an  artifact  of  the  curve  fit,  caused  by  the  rapid  turn-down  in  speedup  at  three  processors. 

The  data  in  Figure  5  show  a  dropoff  below  linear  speedup  as  the  number  of  processors 
increases.  This  dropoff  is  due  in  part  to  a  memory  bottleneck.  The  memory  bottleneck 
occurs  when  many  processors  require  the  same  data  from  memory.  In  the  Alliant  FX/8  each 
of  the  four  quadrants  of  the  fast  cache  memory  can  be  accessed  by  any  two  computational 
elements  (processors)  via  a  crossbar  interconnect.  When  more  than  two  computational 
elements  try  to  access  the  same  data  there  is  “cache  contention”  which  is  a  form  of  memory 
bottleneck.  In  addition  cache  and  main  memory  are  updated  via  hardware  to  maintain 
memory  system  consistency.  Thus  if  more  cache  memories  were  added  there  would  be 


Figure  4:  Speedup  figures  for  the  Encore 

Multimax.  The  fit  uses  T  —  589.5/p  -f  79.5/  min(p,  10)  +  8.22/py  -I-  .29p.  The  problem 
size  is  n  =  100.  The  line  labels  are  the  values  of  Py  used  for  that  curve.  The  solid  line 
represents  a  perfect  speedup.  Note  the  sharper  deviation  from  perfect  speedup  at  roughly 
10  processors. 


speedup  on  flliiant  FX/8 


Figure  5;  Results  for  the  Alliant  FX/8.  The  fit  uses 

T  =  0.00645n*/p  +  0.00085n^/ min(p, 2.45).  The  speedup  is  almost  independent  of  the 
problem  size.  The  solid  line  represents  a  perfect  speedup;  note  the  deviation  from  this  line 
by  both  the  experimental  results  and  the  theory  at  around  three  processors. 


a  memory  bottleneck  as  cache  and  main  memories  are  updated  to  maintain  consistency. 
Another  part  is  due  to  some  of  the  program  running  in  serial  (using  only  a  single  processor); 
this  is  due  in  part  to  the  Fortran  compiler  on  the  Alliant,  in  a  deliberate  trade-off  between 
ease  of  use  and  maximum  parallelism. 

Memory  bottleneck  will  occur  in  any  shared  memory  parallel  computer.  Since  memory 
bottleneck  will  cause  a  falloff  in  the  speedup  curve  below  linear  speedup,  it  is  a  very 
significant  limitation  on  shared  memory  parallel  computers  for  massive  parallelism. 

5.4  Summary  of  experimental  results 

The  agreement  between  the  experiments  and  the  complexity  estimates  is  very  good.  The 
complexity  estimates  capture  the  important  contributions  to  the  computation  time.  Thus 
we  can  use  the  complexity  results  to  predict  performance  for  differing  values  of  the  param¬ 
eters,  including  the  number  of  processors. 

In  the  case  of  a  distributed  memory  parallel  computer,  such  as  the  Intel  Hypercube, 
the  time  estimate  for  the  computation  has  the  form  shown  in  equation  12.  For  simplicity, 
we  will  consider  the  case  Then  the  estimate  can  be  simplified  to 


T  =  Fi - h  Fj—  -h  8(a  -h  r)  logp  12  a  +  r-—  1  . 

P  Vp  \  VpJ 

Because  of  the  logp  term  there  is  a  value  of  p  beyond  which  adding  additional  processors 
actually  slows  the  computation  down.  The  maximum  speed  up  is  achieved  when  dT / dp  = 
0,  and  is  at 

IF2  +  6r  -f-  \/(^F2  -I-  6r)2  -|-  32Fi(a  +  r) 

=  " - • 

From  this  we  can  see  that  the  the  optimal  number  of  processors  depends  on  n.  It  can  be 
shown  that,  as  n  — ♦  00  and  with  this  optimal  p,  the  speedup  tends  to 


Thus,  the  speedup  for  the  distributed  memory  parallel  computer  (e.g.,  Intel  Hypercube) 
increases  without  bound  £is  the  problem  size  and  number  of  processors  grow. 

For  a  shared  memory  parallel  computer,  such  as  the  Alliant  FX/8  or  Encore  Multimax, 
the  time  estimate  for  the  computation  has  the  form  shown  in  equation  14.  (Note  that 
the  shape  of  the  Encore  and  Alliant  results  are  very  similar;  the  terms  corresponding  to 
equation  14  in  both  caises  are  the  dominant  terms.)  The  predicted  speedup  is  independent 
of  the  problem  size,  as  we  observed  experimentally.  The  parameters  determined  by  our 
experiments  give  a  maximum  possible  speedup  of  roughly  20  for  the  Alliant  and  roughly 
80  for  the  Encore.  Of  course,  the  number  of  processors  available  on  both  the  Alliant  and 
the  Encore  were  picked  to  match  the  available  memory  bandwidths.  Other  technologies 
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0.00066  0.00014  0.861 

0.0132  0.0028  0.215 

0.0033  0.0007  0.215 

0.00066  0.00014  0.215 


0.868 

0.868 

0.217 

0.217 

0.217 


0.00132  I  0.00028  I  0.0172  I  0.0176 


0.0302 
0.0302 
0.0302 
0.0302 
0.0302 
0.00755 


Table  2;  Parameters  for  some  possible  distributed  memory  parallel  processors. 


and  designs  would  allow  larger  limits.  The  limits  themselves  are,  however,  intrinsic  in 
any  shared  resource  design.  Thus,  because  of  the  bottlenecks  in  the  shared  memory,  even 
with  an  infinite  number  of  processors,  the  maximum  speedup  of  a  shared  memory  parallel 
computer  is  limited. 

Comparing  these  results,  we  see  that  the  distributed  memory  computers  offer  more 
potential  speedup  than  shared  memory  computers  the  number  of  available  processors 
increzises. 

6  Predictions  and  Observations 

As  an  example  of  the  predictive  use  of  our  complexity  estimates,  we  can  estimate  the 
performance  of  a  distributed  memory  parallel  processor,  starting  with  our  results  for  the 
Intel  Hypercube.  The  relevant  formula  is  equation  17,  written  as 

n2  ^ 

T  =  fi - 1-  /a—  +  Cl  logp  +  C3(n,  +  n^)  +  C2. 

P  Py 

Terms  fi  and  /2  depend  only  on  the  floating  point  speed.  Cj  and  C2  depend  on  the  com¬ 
munication  startup  times,  and  on  the  communication  transfer  rate  (ci  also  depends  on 
floating  point  speed  and  transfer  rate;  however,  startup  times  will  usually  dominate  Ci). 

Table  2  shows  possible  parameters  for  some  distributed  memory  hypercubes.  All  of 
these  estimates  are  drawn  from  machines  which  are  either  available  today  or  planned 
for  the  near  future.  Machine  (a)  is  the  Intel  hypercube  used  in  our  tests.  Machine  (b) 
represents  a  machine  with  4  megaflop  vector  boards.  Machine  (c)  is  a  more  balanced 
machine,  with  both  floating  point  and  communication  startup  times  reduced  through  a 
faster  processor  (such  as  the  Intel  80386).  Machine  (d)  is  (c),  but  with  four  times  as  fast 
floating  point.  This  might  come  from  using  a  non-vector/pipelined  feist  floating  point  chip. 
Machine  (e)  is  (c),  using  the  vector  boards  from  (b).  Finally,  Machine  (f)  uses  high  speed 
combined  communications  and  computation  hardware,  such  eis  in  the  INMOS  Transputer 
chips. 


si 


Table  3:  Predicted  times  and  speedups  for  various  distributed  memory  parallel  processors. 
Times  are  in  the  column  T  and  speedups  are  in  the  column  S. 


In  Table  3  we  show  the  predicted  performance  of  a  number  of  possible  distributed 
memory  parallel  computers  described  in  Table  2. 

The  relatively  poor  showing  of  machines  (b)  and  (e)  is  due  to  the  large  communication 
terms,  and  suggests  changes  to  both  the  algorithm  (to  reduce  data  exchanges)  and  the 
hardware  (to  make  communication  speeds  comparable  to  floating  point  speeds).  The 
figures  for  Machine  (f)  show  that  if  communication  speeds  (particularly  startups,  the  “a” 
term)  are  increased,  very  respectable  speedups  can  be  obtained,  even  for  large  number  of 
processors.  Similar  calculations  can  be  performed  for  proposed  shared  memory  machines. 

It  is  also  interesting  to  compare  the  amount  of  programming  work  needed  to  run  our 
experiments.  In  no  case  was  much  effort  required  to  modify  the  original  serial  code  to  run 
on  any  of  the  parallel  machines.  The  easiest  to  run  was  the  Alliant,  since  the  Fortran  com¬ 
piler  on  the  Alliant  automatically  makes  reasonably  good  use  of  the  parallelism.  Neither 
the  Encore  nor  the  Intel  machines  have  such  compilers,  and  each  required  a  small  amount 
of  special  coding.  In  addition,  the  Allitint  required  some  special  coding  to  improve  the 
performance.  The  overall  efforts  were  similar. 


7  Conclusions 


We  have  shown  that  significant  parallelism  is  available  at  least  for  explicit  methods  for 
CFD.  Our  complexity  estimates,  validated  by  our  experiments,  show  speedups  limited 
only  by  the  problem  size  for  distributed  memory  architectures.  For  shared  memory  archi¬ 
tectures,  bottlenecks  at  the  shared  resource  constrain  the  available  speedups. 

In  actual  practice,  our  results  show  that  the  communications  costs  associated  with 
distributed  memory  parallel  computers  can  seriously  degrade  the  performance  of  these 
systems.  Commercial  shared  memory  computers  such  as  the  Alliant  FX/8  have  achieved 
a  better  balance  of  system  components,  at  the  (recognized)  cost  of  limited  parallelism. 
Since  only  massive  parallelism  has  the  promise  of  orders  of  magnitude  increases  in  speed, 


we  hope  that  our  results  and  ones  like  them  will  stimulate  both  algorithm  and  hardware 
designers  to  overcome  these  practical  problems. 

In  this  paper  we  have  considered  shared  memory  and  distributed  memory  parallel 
computers.  In  reality,  actual  computers  are  rarely  so  easily  categorized.  For  example, 
a  computer  presenting  the  programmer  with  a  shared  memory  model  but  implemented 
in  hardware  with  several  levels  of  distributed  memory  (a  hierarchical  memory)  offers  an 
intermediate  form  of  parallel  computer.  The  significance  of  our  results  is  that,  for  signifi¬ 
cant  parallelism,  use  of  any  shared  resource  that  presents  a  bottleneck  must  be  extremely 
limited.  The  success  of  the  distributed  memory  code  shows  that  explicit  CFD  codes  can 
be  easily  written  in  such  a  form  which  avoids  these  bottlenecks. 
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